Probabilistic Network Models for Word Sense Disambiguation

نویسندگان

  • Gerald Chao
  • Michael G. Dyer
چکیده

We present the techniques used in the word sense disambiguation (WSD) system that was submitted to the SENSEVAL-2 workshop. The system builds a probabilistic network per sentence to model the dependencies between the words within the sentence, and the sense tagging for the entire sentence is computed by performing a query over the network. The salient context used for disambiguation is based on sentential structure and not positional information. The parameters are established automatically and smoothed via training data, which was compiled from the SemCor corpus and the WordNet glosses. Lastly, the One-sense-per-discourse ( OSPD) hypothesis is incorporated to test its effectiveness. The results from two parameterization techniques and the effects of the OSPD hypothesis are presented. 1 Problem Formulation WSD is treated in this system as a classification task, where the ith sense (W #i) of a word (W) is classified as the correct sense tag (M;), given the word W and usually some surrounding context. In the SENSEVAL-2 English all-words task, all ambiguous content words (nouns, verbs, adjectives, and adverbs) are to be classified with a sense tag from the WordNet 1.7 lexical database (Miller, 1990). For example, the words "great", "devastated", and "region" in the sentence "The great hurricane devastated the region" are classified with the correct sense tags 2, 2, and 2, respectively. We will refer to this task using the following notation: M = Mbest(S) = arg maxP(MIS), (1) where S is the input sentence, and M is the semantic tag assigned to each word. While a context larger than the sentence S can be and is used in our model, we will refer to the context asS. In this formulation, each word W; in the sentence is treated as a random variable M; taking on the values {1 .. Ni}, where N; is the number of senses for the word W;. Therefore, we wish to find instantiations of M such that P(MIS) is maximized. 63 To make the computation of Mbest(S) more tractable, it can be decomposed into Mbest(S) ~ arg max(II;P(M;IS)), where it is assumed that each word can be disambiguated independently. However, this assumption does not always hold, since disambiguating one word often affects the sense assignment of another word within the same sentence. Alternatively, the process can be modeled as a Markov model, e.g., Mbest(S) ~ arg max(II;P(W;IM;) X P(M;IM;-I)). While the Markov model requires fewer parameters, it is unable to capture the long-distance dependencies that occur in natural languages. Although the first decomposition better captures these dependencies, computing P(M;IS) using the full sentential context is rarely used, since the number of parameters required grows exponentially with each added context. Therefore, one can further simplify this model by narrowing the context to 2n number of surrounding words, i.e., P(M;IS) ~ P(M;IW;-n, ... W;-I, W;+I, ... Wi+n)· However, narrowing the context also discards long-distance relationships, making it closer to a Markov model. Without having to artificially limit the size of the context, another possible simplification is to make independence assumptions between the context words. In the simplest case, every context is assumed to be independent from each other, i.e., P(M;IS) ~ IIxP(M;IWx), like a Naive Bayes classifier. While the parameters can be simply established by a set of bi-grams, the independence assumption is often too strong and thus negatively affects accuracy. The difficulty is in choosing the context that would maximize the accuracy while allowing for reliable parameter estimation from training data. In our model, we aim to strike this balance by choosing the context words based on structural information, rather than positional information. The hypothesis is that an ambiguous word is probabilistically dependent on its structurally related words and is independent of the rest of the sentence. Therefore, long-distance dependencies can still be captured, while the context is kept small. FurtherP(A,B,C,D,E,F)=P(AIB,C)xP(BID,F)xP(CID) xP(DIE)xP(E)xP(F) Figure 1: An example of a Bayesian network and the probability tables at each node that define the relationships between a node and its parents. The equation at the bottom shows how the distribution is represented by the network. more, each word is not classified independently of each other, but is computed as one single query that determines all of the sense assignments that result in the highest overall probability for the whole sentence. Therefore, our model is a combination of the decompositions described above, by selectively making independence assumptions on a per-word basis to best model P(MdS), while computing Mbest(S) in one query to allow for interactions between the word senses M;. 1.1 Bayesian Networks This process is achieved by using Bayesian networks to model the dependencies between each word and its contextual words, and based on the parameterization, compute the best overall sense assignments. A Bayesian network is a directed acyclic graph G that represents a joint probability distribution P(X1 , ... ,Xn) across the random variables of each node in the graph. By making independence assumptions between variables, each node i is conditionally dependent upon only its parents P A; (Pearl, 1988): P(X1, ... ,Xn) = II;P(X;IPA;). By using this representation, the number of probabilities needed to represent the distribution can be significantly reduced. Figure 1 shows an example Bayesian network representing the distribution P(A,B,C,D,E,F). Instead of having one large table with 26 parameters (with all Boolean nodes), the distribution is represented by the conditional probability tables (CPTs) at each node, such as P(B I D, F) at node B, requiring a total of only 24 parameters for the whole distribution. Not only do the savings become more significant with larger networks, but the sparse data problem becomes more manageable as well. The training set no longer needs to cover all permutations of the feature sets, but only smaller subsets dictated by the sets of variables of the CPTs. In our model using Bayesian networks for WSD, each word is represented by the random variable 64 M; as a node in G. We then find a set of parents P A; that M; depends on, based on structural information. Using this representation, the number of parameters is significantly reduced. If the average number of parents per node is 2, and if the average number of senses per word is 5, then the joint distribution across the whole sentence P(M1 , .. , MN) is represented by the Bayesian network with ~ s(2+I) * N parameters. This is in contrast to a full joint distribution table that would contain 5N entries, which is obviously intractable for any sentence of non-trivial length N. Bayesian networks also facilitate the computation of the instantiations for M; such that P(M1 , •. , MN) is maximum. Instead of looking for the maximum row in the table with 5N entries, this computation is made tractable by using Bayesian networks. Specifically, this query, called Maximum A Posteriori (MAP), can be computed in 0(5w), where w < < N and indicates the connectiveness of G. Using the same notation above, the process of a whole-sentence word sense disambiguation using probabilistic networks can be described as the following: ~ arg maxii;(P(M;IMPA;)P(M;IW;, WpA.)). (2) The first approximation is based on our hypothesis of a word's sense is dependent only on structurally related words. It is further decomposed in the second term to minimize the sparse data problem. This process consists of three major steps: 1) defining the structure of the Bayesian network G, 2) quantifying the network with probabilities from training data (P(M;!W;, WpAJ), and finally, 3) answering the query of the most probable word sense assignments (arg maxii;( ... )).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Sense Disambiguation Using Bilingual Probabilistic Models

We describe two probabilistic models for unsupervised word-sense disambiguation using parallel corpora. The first model, which we call the Sense model, builds on the work of Diab and Resnik (2002) that uses both parallel text and a sense inventory for the target language, and recasts their approach in a probabilistic framework. The second model, which we call the Concept model, is a hierarchica...

متن کامل

Learning Probabilistic Models of Word Sense Disambiguation

This dissertation presents several new methods of supervised and unsupervised learning of word sense disambiguation models. The supervised methods focus on performing model searches through a space of probabilistic models, and the unsupervised methods rely on the use of Gibbs Sampling and the Expectation Maximization (EM) algorithm. In both the supervised and unsupervised case, the Naive Bayesi...

متن کامل

Potts Model on the Case Fillers for Word Sense Disambiguation

We propose a new method for word sense disambiguation for verbs. In our method, sense-dependent selectional preference of verbs is obtained through the probabilistic model on the lexical network. The meanfield approximation is employed to compute the state of the lexical network. The outcome of the computation is used as features for discriminative classifiers. The method is evaluated on the da...

متن کامل

Building Instance Knowledge Network for Word Sense Disambiguation

In this paper, a new high precision focused word sense disambiguation (WSD) approach is proposed, which not only attempts to identify the proper sense for a word but also provides the probabilistic evaluation for the identification confidence at the same time. A novel Instance Knowledge Network (IKN) is built to generate and maintain semantic knowledge at the word, type synonym set and instance...

متن کامل

رفع ابهام معنایی واژگان مبهم فارسی با مدل موضوعی LDA

Word sense disambiguation is the task of identifying the correct sense for the word in a given context among a finite set of possible sense. In this paper a model for farsi word sense disambiguation is presented. The model use two group of features: first, all word and stop words around target word and topic models as second features. We extract topics from a farsi corpus with Latent Dirichlet ...

متن کامل

Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection

This paper presents a probabilistic model for sense disambiguation which chooses the best sense based on the conditional probability of sense paraphrases given a context. We use a topic model to decompose this conditional probability into two conditional probabilities with latent variables. We propose three different instantiations of the model for solving sense disambiguation problems with dif...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001